Skip to content

Ensure single reader and writer to system fd on Unix#16209

Merged
straight-shoota merged 5 commits intocrystal-lang:masterfrom
ysbaddaden:feature/add-crystal-fd-lock-with-refcount-and-serial-rw
Dec 1, 2025
Merged

Ensure single reader and writer to system fd on Unix#16209
straight-shoota merged 5 commits intocrystal-lang:masterfrom
ysbaddaden:feature/add-crystal-fd-lock-with-refcount-and-serial-rw

Conversation

@ysbaddaden
Copy link
Copy Markdown
Collaborator

@ysbaddaden ysbaddaden commented Oct 14, 2025

This patch extends the fdlock to serialize reads and writes by extending the reference counted lock with a read lock and a write lock, so taking a reference and locking acts as a single operation instead of two (1. acquire/release the lock; 2. take/return a reference). This avoids a race condition in the polling event loops:

  • Fiber 1 then Fiber 2 try to read from fd;
  • Since fd isn't ready, both fibers start waiting;
  • When fd becomes ready then Fiber 1 is resumed;
  • Fiber 1 doesn't read everything and returns;
  • Since events are edge-triggered, Fiber 2 won't be resumed!!!

With the read lock, fiber 2 will wait on the lock then be resumed by fiber 1 when it returns. A concrete example is multiple fibers waiting to accept on a socket where fiber 1 would keep handling connections, while fiber 2 sits idle.

The other benefit is that it can help to simplify the evloops that will now only deal with a single reader + single writer per IO and is required for the io_uring evloop (the MT version requires it).

NOTE: While this patch only serializes reads/writes on UNIX at the Crystal::System, which is where the bugs are, we may want to move will move it into stdlib for all targets at some point, for example to serialize reads and writes around IO::Buffered. See #16289 (comment)

Depends on #16288 and #16289.
Required by #16264.

@ysbaddaden
Copy link
Copy Markdown
Collaborator Author

I split the fdlock in two different commits (refcount then serial R/W) that outline the different steps for merging as individual PRs.

Comment thread src/crystal/fd_lock.cr Outdated
Comment thread src/crystal/fd_lock.cr Outdated
ysbaddaden added a commit to ysbaddaden/crystal that referenced this pull request Oct 18, 2025
ysbaddaden added a commit to ysbaddaden/crystal that referenced this pull request Oct 24, 2025
@ysbaddaden ysbaddaden mentioned this pull request Oct 24, 2025
5 tasks
@ysbaddaden ysbaddaden force-pushed the feature/add-crystal-fd-lock-with-refcount-and-serial-rw branch from 904dd95 to 6be2dd7 Compare October 28, 2025 16:16
@ysbaddaden ysbaddaden changed the title Fix: closing fd is thread unsafe on UNIX targets UNIX: ensure single reader and writer to system fd Oct 28, 2025
@ysbaddaden ysbaddaden force-pushed the feature/add-crystal-fd-lock-with-refcount-and-serial-rw branch from 6be2dd7 to b92814a Compare October 30, 2025 17:39
Serializes reads and writes so we can assume any IO object will only
have at most one read op and one write op. The benefits are:

1. it avoids a race condition in the polling event loops:

   - Fiber 1 then Fiber 2 try to read from fd;
   - Since fd isn't ready so both are waiting;
   - When fd becomes ready then Fiber 1 is resumed;
   - Fiber 1 doesn't read everything and returns;
   - Fiber 2 won't be resumed because events are edge-triggered;

2. we can simplify the UNIX event loops (epoll, kqueue, io_uring) that
   are guaranteed to only have at most one reader and one writer at any
   time.
@ysbaddaden ysbaddaden force-pushed the feature/add-crystal-fd-lock-with-refcount-and-serial-rw branch from b92814a to 72507a7 Compare November 27, 2025 15:12
@ysbaddaden ysbaddaden marked this pull request as ready for review November 27, 2025 15:13
@ysbaddaden
Copy link
Copy Markdown
Collaborator Author

Rebased from master to bring #16288 and #16289 that this patch depends on.

@ysbaddaden
Copy link
Copy Markdown
Collaborator Author

ysbaddaden commented Nov 27, 2025

Usages are still restricted to Crystal::System types on UNIX because trying to move it out requires to implement IOCP#shutdown that itself needs the single reader and single writer locks that this PR brings (chicken-egg issue 🐣) for reasons explained in #16289 (comment).

I'll prepare a third PR that will:

  • Implement Crystal::EventLoop::IOCP#shutdown(Socket);
  • Implement Crystal::EventLoop::IOCP#shutdown(IO::FileDescriptor);
  • Move @fd_lock to IO::FileDescriptor and Socket.

Copy link
Copy Markdown
Member

@straight-shoota straight-shoota left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Comment thread src/crystal/fd_lock.cr Outdated
Comment thread spec/std/crystal/fd_lock_spec.cr Outdated
@straight-shoota straight-shoota added this to the 1.19.0 milestone Nov 27, 2025
@straight-shoota straight-shoota merged commit 7bdbd04 into crystal-lang:master Dec 1, 2025
49 checks passed
@github-project-automation github-project-automation bot moved this from Review to Done in Multi-threading Dec 1, 2025
@straight-shoota straight-shoota changed the title UNIX: ensure single reader and writer to system fd Ensure single reader and writer to system fd on Unix Dec 1, 2025
@ysbaddaden ysbaddaden deleted the feature/add-crystal-fd-lock-with-refcount-and-serial-rw branch December 1, 2025 11:58
straight-shoota pushed a commit that referenced this pull request Apr 14, 2026
Implements an event loop that leverages **io_uring** on Linux targets.

### Requirements

The event loop requires different features that have been added in different versions of the kernel. At a minimum Linux 5.19 is required, while the recent Linux 6.13 is recommended. It is thus compatible with Linux 6.1 SLTS but not previous (S)LTS kernels.

The io_uring event loop is disabled by default. It must be enabled manually at compile time with the `-Devloop=io_uring` flag.

The SQPOLL feature is support but disabled by default. It allows to avoid syscalls on submissions & completions which is very cool... but it [uses _lots_ of CPU](https://unixism.net/loti/tutorial/sq_poll.html) 🔥. It can be enabled at compile time with the `IORING_SQ_THREAD_IDLE` environment variable (in milliseconds) that sets the idle time for the SQPOLL thread.

For example:

```crystal
export IORING_SQ_THREAD_IDLE=200
crystal build app.cr -Devloop=io_uring
```

### Implementation details

The basic implementation was straightforward. It's basically an async framework: submit an operation, suspend the fiber, and resume it when the operation has completed.

This is also the second event loop that uses blocking IO after IOCP on Windows, and the first one on UNIX.

The main issue is a Linux limitation where close doesn't interrupt pending operations in the kernel, so we must shutdown sockets and cancel pending ops on files for example.

### Threads Support & Safety

The MT safe implementation (preview_mt, execution_context) was much more complex. Unlike the other event loops, we can't have a single ring as it would require to lock on every submit, and with multiple threads it would create a contention and would likely require syscalls (that would defeat the point), so we need a ring per thread (sharing the same kernel resources).

There's thus a new API to register execution context schedulers to the event loop, so we can create/close rings as needed. Since a scheduler can shutdown (e.g. after a resize down), the execution context must also drain its ring before the scheduler can stop: all the pending operations must have completed and all the pending fibers be enqueued.

We need cross rings communication for a couple scenarios: to interrupt a thread waiting on the event loop, and for cancelling pending read/write file operations (the serial R/W of #16209 is required). At worst, this communication needs a lock on submit (which is avoided on Linux 6.13+). Unlike the single ring, the lock should usually not be contented in practice (unless you open lots of files, read/write from many fibers to the same file and close from whatever fiber).

Unlike the other event loops, there isn't a single system instance for the whole event loop (e.g. one epoll, kqueue or IOCP), and each scheduler is responsible for its own completion queue... which means that we're back into the "a busy thread can block runnable fibers" in its completion queue while there might be starving threads. A busy thread can be a CPU bound fiber, or a pair of fibers that keep re-enqueue each other.

To avoid this situation, once in a while + every time a scheduler would wait on the event loop (starving), the event loop will instead iterate the completion rings and try to steal runnable fibers from other threads. That requires a lock on the completion queue, that should also usually not be contended (only once in a while).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants